Language-Independent Text Parsing of Arbitrary HTML-Documents. Towards A Foundation For Web Genre Identification
نویسنده
چکیده
This article describes an approach to parsing and processing arbitrary web pages in order to detect macrostructural objects such as headlines, explicitlyand implicitly-marked lists, and text blocks of different types. The text parser analyses a document by means of several processing stages and inserts the analysis results directly into the DOM tree in the form of XML elements and attributes, so that both the original HTML structure, and the determined macrostructure are available at the same time for secondary processing steps. This text parser is being developed for a novel kind of search engine that aims to classify web pages into web genres so that the search engine user will be able to specify one or more keywords, as well as one or more web genres of the documents to be found.
منابع مشابه
Towards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage
We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...
متن کاملConcurrent programming on the web with Webstream
We describe Webstream, a language to simplify the development of client-side web applications, particularly web-aware information agents. Webstream encapsulates web documents as streams of messages passing between concurrent lightweight threads, permitting operations to be carried out lazy-evaluation style while documents are in the process of being retrieved. Streams can be pipelined through f...
متن کاملThesWB: A Tool for Thesaurus Construction from HTML Documents
Electronically available documents on the Web are exploding at an ever-increasing rate. Many Web documents, however, contain rich knowledge that describes the document's content. The Web can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. A tag is in HTML (Hyper-Text Markup Language) meta-data describing the layout and linking structu...
متن کاملExploring Impacts of Consciousness-raising in a Genre-based Pedagogy
This study reports on the findings of a genre teaching course for developing academic writing of a class of EFL students in Iran. The information report genre was taught in a cyclical way of teaching and learning, which was started from ‘setting the context’ and ‘deconstruction’ of prototype information report genre, and continued with ‘joint construction’, ‘independent construction’, and final...
متن کاملComplementary Approaches to Representing Differences Between Structured Documents
Structured documents Documents can be represented as structures with a hierarchical arrangement of text and non-text nodes, where nodes are labelled by category names such as “paragraph” and “section”. Representing documents this way is a natural consequence of using the Standard Generalized Markup Language (SGML) to encode the content and form of documents [10, 11, 7]. SGML is widely used. HTM...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- LDV Forum
دوره 20 شماره
صفحات -
تاریخ انتشار 2005